Week 3 of 12 · Part A — Applied Safety

Running the Operation

Locking in Week 3 — a responsible red-team operation you could describe end to end, and the lines you won't cross

Day 15 ~50 minutes Review

Day 15 of 60

What you now hold

Three weeks in, you can run a red-team as a discipline, not a stunt. You can reframe it as a controlled, recorded, defensive operation; design a plan with coverage, success criteria, logging fields, and an escalation path; turn the log into a payload-free coverage-and-ASR report that surfaces both weak and untested categories; and scale it with automated red-teaming while naming exactly where automation stops. And through all of it you treat the people doing the work as something to protect, not a firehose to point at the worst material.

The through-line of Week 3

Responsible red-teaming is the defensive practice of finding failures on purpose, under controlled conditions, in categories rather than recipes, measured by coverage and attack-success, scaled by automation, and bounded by ethics and the well-being of the red-teamers. The deliverable is a record that makes a model safer — never a kit that makes misuse easier.

The end-to-end operation, in one breath

The Operation

1 · Frame & plan (Days 11–12)

Set the defensive posture; define attack categories, per-category success criteria against your Week 2 policy, the logging fields, the escalation path, and the well-being protocol — all before a single attempt.

2 · Run & measure (Day 13)

Log every attempt as category + outcome + severity (raw detail sealed in a secured store), then compute per-category ASR and flag the untested categories. Two risks reported: visible weakness and invisible blind spots.

3 · Scale & bound (Day 14)

Extend coverage with automated red-teaming for breadth, keep success-definition, novel-failure hunting, and the bright lines human — and state automation's limits out loud.

4 · Route & protect (throughout)

Send real findings down the escalation path to a fix; rotate red-teamers off heavy categories and enforce exposure limits. The operation makes the model safer and doesn't quietly harm its own people.

The ethical line is part of the skill

The hardest part of running a red-team is knowing what you won't do. You probe for weakness without producing operational misuse content; you store categories, not recipes; you protect real people's privacy; and you refuse to let "we're red-teaming" become a license to generate the very harms you're meant to defend against. Being able to name that line, crisply, is a senior signal.

Self-quiz — can you do these without notes?

Prove the Week

~50 minutes

Describe a responsible red-team operation end to end — frame, plan, run, measure, scale, route — in your own words, using Ganguli et al. as your reference for the defensive posture and red-teamer well-being.
Define coverage and attack-success rate, and explain why an untested category is a risk, not a pass.
Explain how a model red-teams another model (Perez et al.) and name two limits of the automated half; reference how HarmBench standardizes attack-success measurement.
From memory, list the four parts of a red-team plan and the elements of a well-being protocol a new lead could follow.
Write the Week 3 summary in your own words (a paragraph), and the single hardest ethical line you would refuse to cross while red-teaming — and why.

The expert move

A practitioner can break a model. An expert can run the whole operation responsibly — plan it, measure it without storing payloads, scale it, route the findings, protect the people — and can state the lines they won't cross with the same precision they bring to the metrics. The altitude jump is from "I can find failures" to "I can run a defensible, ethical, measurable red-team program that a team and a regulator would both trust."

Say this in an interview: "I run red-teaming as a defensive operation: a plan with coverage and success criteria, a payload-free log that yields per-category ASR and flags blind spots, automation for breadth with humans on judgment, an escalation path for real finds, and a well-being protocol for the red-teamers. And I can tell you exactly which lines I won't cross — because knowing them is part of the job."

Week 3 Takeaways

Responsible red-teaming is a defensive operation: categories not recipes, measured by coverage + ASR, bounded by ethics.
The four-part plan and the payload-free coverage report turn poking into a measurable, auditable program.
Automation multiplies breadth; humans keep judgment, success-definition, and the bright lines.
Next week: turn found failures into repeatable safety evaluations that measure both harmful compliance and over-refusal.